Import Library pandas, numpy for data manipulating
Import Library matplotlib for data plotting
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
At first, I set low_memory=False to treat blanks as strings. Later, I found it consumed memory, so I used the info() function to summarize the DataFrame.
#metro = pd.read_csv("datasets/metro.csv", low_memory=False)
metro = pd.read_csv("datasets/metro.csv")
# Take an overview of the data
metro.info()
# Check missing data
metro.isnull().sum()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 92124 entries, 0 to 92123 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 trip_id 92124 non-null int64 1 duration 92124 non-null int64 2 start_time 92124 non-null object 3 end_time 92124 non-null object 4 start_station 92124 non-null int64 5 start_lat 89985 non-null float64 6 start_lon 89985 non-null float64 7 end_station 92124 non-null int64 8 end_lat 88052 non-null float64 9 end_lon 88052 non-null float64 10 bike_id 92124 non-null object 11 plan_duration 92124 non-null int64 12 trip_route_category 92124 non-null object 13 passholder_type 92124 non-null object 14 bike_type 92124 non-null object dtypes: float64(4), int64(5), object(6) memory usage: 10.5+ MB
/var/folders/q7/8l57xp494s5832pk1vz3_4tc0000gn/T/ipykernel_59877/1147889363.py:2: DtypeWarning: Columns (10) have mixed types. Specify dtype option on import or set low_memory=False.
metro = pd.read_csv("datasets/metro.csv")
trip_id 0 duration 0 start_time 0 end_time 0 start_station 0 start_lat 2139 start_lon 2139 end_station 0 end_lat 4072 end_lon 4072 bike_id 0 plan_duration 0 trip_route_category 0 passholder_type 0 bike_type 0 dtype: int64
The result list shows that:
metro.bike_id.apply(type).value_counts()
bike_id <class 'str'> 65536 <class 'int'> 26588 Name: count, dtype: int64
It is possible to convert numbers to strings, but not the other way around since ID-type data can be stored as a string type.
When inputting data, specify the datatype of bike_id and trip_id as string. Since trip_id is a unique sequence, it can be dealt with as a string.
For a simpler dataset, remove latitude, longitude, and bike_id columns. Convert start_time and end_time to datetime format for data manipulation.
dtype_spec = {"bike_id": "string","trip_id": "string"}
metro = pd.read_csv("datasets/metro.csv", dtype = dtype_spec)
metro['start_time'] = pd.to_datetime(metro['start_time'])
metro['end_time'] = pd.to_datetime(metro['end_time'])
metro.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 92124 entries, 0 to 92123 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 trip_id 92124 non-null string 1 duration 92124 non-null int64 2 start_time 92124 non-null datetime64[ns] 3 end_time 92124 non-null datetime64[ns] 4 start_station 92124 non-null int64 5 start_lat 89985 non-null float64 6 start_lon 89985 non-null float64 7 end_station 92124 non-null int64 8 end_lat 88052 non-null float64 9 end_lon 88052 non-null float64 10 bike_id 92124 non-null string 11 plan_duration 92124 non-null int64 12 trip_route_category 92124 non-null object 13 passholder_type 92124 non-null object 14 bike_type 92124 non-null object dtypes: datetime64[ns](2), float64(4), int64(4), object(3), string(2) memory usage: 10.5+ MB
To have better control over the time data, the 'start_time' and 'end_time' are transformed to datetime. During the data cleaning stage, properties such as latitude, longitude, bike_id, and trip_id which are less necessary for analysis, will be removed in reference to the data source.
metro.describe()
| duration | start_time | end_time | start_station | start_lat | start_lon | end_station | end_lat | end_lon | plan_duration | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 92124.000000 | 92124 | 92124 | 92124.000000 | 89985.000000 | 89985.000000 | 92124.000000 | 88052.000000 | 88052.000000 | 92124.000000 |
| mean | 33.168588 | 2019-08-17 07:30:19.200208640 | 2019-08-17 08:13:36.216621056 | 3484.899690 | 34.034786 | -118.287893 | 3480.271026 | 34.034895 | -118.286699 | 60.290977 |
| min | 1.000000 | 2019-07-01 00:04:00 | 2019-07-01 00:09:00 | 3000.000000 | 33.710979 | -118.495422 | 3000.000000 | 33.710979 | -118.495422 | 1.000000 |
| 25% | 6.000000 | 2019-07-26 06:13:15 | 2019-07-26 07:08:15 | 3029.000000 | 34.035801 | -118.281181 | 3028.000000 | 34.037048 | -118.280952 | 1.000000 |
| 50% | 12.000000 | 2019-08-18 11:05:00 | 2019-08-18 12:54:00 | 3062.000000 | 34.046810 | -118.258537 | 3062.000000 | 34.046810 | -118.258537 | 30.000000 |
| 75% | 22.000000 | 2019-09-08 16:21:00 | 2019-09-08 16:54:15 | 4285.000000 | 34.051941 | -118.248253 | 4285.000000 | 34.051941 | -118.248253 | 30.000000 |
| max | 1440.000000 | 2019-09-30 23:58:00 | 2019-10-02 12:04:00 | 4453.000000 | 34.177662 | -118.231277 | 4453.000000 | 34.177662 | -118.231277 | 999.000000 |
| std | 129.057841 | NaN | NaN | 611.483883 | 0.058803 | 0.073501 | 609.942741 | 0.058790 | 0.072628 | 111.141364 |
The table displays the statistics of time and duration categories, which are significant as they reveal important information such as mean, minimum, maximum value and quartile data. However, the station column is a categorical data type represented by numbers, which can lead to misleading results. To avoid this, it is recommended to transform the station column into a string format.
# transform to stirng
metro['start_station'] = metro['start_station'].apply(str)
metro['end_station'] = metro['end_station'].apply(str)
metro.info()
# the statistics includes numeric data only
metro.describe()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 92124 entries, 0 to 92123 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 trip_id 92124 non-null string 1 duration 92124 non-null int64 2 start_time 92124 non-null datetime64[ns] 3 end_time 92124 non-null datetime64[ns] 4 start_station 92124 non-null object 5 start_lat 89985 non-null float64 6 start_lon 89985 non-null float64 7 end_station 92124 non-null object 8 end_lat 88052 non-null float64 9 end_lon 88052 non-null float64 10 bike_id 92124 non-null string 11 plan_duration 92124 non-null int64 12 trip_route_category 92124 non-null object 13 passholder_type 92124 non-null object 14 bike_type 92124 non-null object dtypes: datetime64[ns](2), float64(4), int64(2), object(5), string(2) memory usage: 10.5+ MB
| duration | start_time | end_time | start_lat | start_lon | end_lat | end_lon | plan_duration | |
|---|---|---|---|---|---|---|---|---|
| count | 92124.000000 | 92124 | 92124 | 89985.000000 | 89985.000000 | 88052.000000 | 88052.000000 | 92124.000000 |
| mean | 33.168588 | 2019-08-17 07:30:19.200208640 | 2019-08-17 08:13:36.216621056 | 34.034786 | -118.287893 | 34.034895 | -118.286699 | 60.290977 |
| min | 1.000000 | 2019-07-01 00:04:00 | 2019-07-01 00:09:00 | 33.710979 | -118.495422 | 33.710979 | -118.495422 | 1.000000 |
| 25% | 6.000000 | 2019-07-26 06:13:15 | 2019-07-26 07:08:15 | 34.035801 | -118.281181 | 34.037048 | -118.280952 | 1.000000 |
| 50% | 12.000000 | 2019-08-18 11:05:00 | 2019-08-18 12:54:00 | 34.046810 | -118.258537 | 34.046810 | -118.258537 | 30.000000 |
| 75% | 22.000000 | 2019-09-08 16:21:00 | 2019-09-08 16:54:15 | 34.051941 | -118.248253 | 34.051941 | -118.248253 | 30.000000 |
| max | 1440.000000 | 2019-09-30 23:58:00 | 2019-10-02 12:04:00 | 34.177662 | -118.231277 | 34.177662 | -118.231277 | 999.000000 |
| std | 129.057841 | NaN | NaN | 0.058803 | 0.073501 | 0.058790 | 0.072628 | 111.141364 |
Although the table is now more apparent, the latitude and longitude data are still meaningless in the statistics and will be processed later.
metro.head()
| trip_id | duration | start_time | end_time | start_station | start_lat | start_lon | end_station | end_lat | end_lon | bike_id | plan_duration | trip_route_category | passholder_type | bike_type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 124657107 | 5 | 2019-07-01 00:04:00 | 2019-07-01 00:09:00 | 4312 | 34.066990 | -118.290878 | 4410 | 34.063351 | -118.296799 | 6168 | 30 | One Way | Monthly Pass | standard |
| 1 | 124657587 | 9 | 2019-07-01 00:07:00 | 2019-07-01 00:16:00 | 3066 | 34.063389 | -118.236160 | 3066 | 34.063389 | -118.236160 | 17584 | 30 | Round Trip | Monthly Pass | electric |
| 2 | 124658068 | 5 | 2019-07-01 00:20:00 | 2019-07-01 00:25:00 | 4410 | 34.063351 | -118.296799 | 4312 | 34.066990 | -118.290878 | 18920 | 30 | One Way | Monthly Pass | electric |
| 3 | 124659747 | 20 | 2019-07-01 00:44:00 | 2019-07-01 01:04:00 | 3045 | 34.028511 | -118.256668 | 4275 | 34.012520 | -118.285896 | 6016 | 1 | One Way | Walk-up | standard |
| 4 | 124660227 | 27 | 2019-07-01 00:44:00 | 2019-07-01 01:11:00 | 3035 | 34.048401 | -118.260948 | 3049 | 34.056969 | -118.253593 | 5867 | 30 | One Way | Monthly Pass | standard |
Find distinct values for plan_duration, trip_route_category, passholder_type, and bike_type.
plt.figure(figsize=(16,8))
plt.subplot(2,2,1)
plan = metro["plan_duration"].unique()
plan_count = metro["plan_duration"].value_counts()
y_pos = y_pos = np.arange(len(plan))
plt.bar(y_pos, plan_count)
plt.xticks(y_pos, plan)
plt.xlabel('Plan Duration')
plt.ylabel('Count')
plt.subplot(2,2,2)
route = metro["trip_route_category"].unique()
route_count = metro["trip_route_category"].value_counts()
y_pos = y_pos = np.arange(len(route))
plt.bar(y_pos, route_count)
plt.xticks(y_pos, route)
plt.xlabel('Trip Route Category')
plt.ylabel('Count')
plt.subplot(2,2,3)
passholder = metro["passholder_type"].unique()
passholder_count = metro["passholder_type"].value_counts()
y_pos = y_pos = np.arange(len(passholder))
plt.bar(y_pos, passholder_count)
plt.xticks(y_pos, passholder)
plt.xlabel('Passholder Type')
plt.ylabel('Count')
plt.subplot(2,2,4)
bike = metro["bike_type"].unique()
bike_count = metro["bike_type"].value_counts()
y_pos = y_pos = np.arange(len(bike))
plt.bar(y_pos, bike_count)
plt.xticks(y_pos, bike)
plt.xlabel('Bike Type')
plt.ylabel('Count')
plt.show()
The set of bar plots indicate that the proportion of each attribute categorcal values.
It can be initially seen that the consumption trends are as follows
The number of "Flex Pass" and "Testing" is too small relative to other categories and may be barely visible on the histogram.
print(metro.groupby('passholder_type')['duration'].size())
passholder_type Annual Pass 6220 Flex Pass 6 Monthly Pass 57175 One Day Pass 5175 Testing 46 Walk-up 23502 Name: duration, dtype: int64
passholder = metro["passholder_type"].unique()
passholder_count = metro["passholder_type"].value_counts()
y_pos = np.arange(len(passholder))
plt.bar(y_pos, passholder_count)
plt.xticks(y_pos, passholder)
plt.xlabel('Passholder Type')
plt.ylabel('Count')
plt.yscale('log')
Found that "Testing" data exists. However, when considering data analysis those data should be excluded.
It will be processed later in the data cleaning statge.
metro.duplicated().sum()
0
No duplicated data.
columns_to_remove = ['trip_id' ,'start_lat', 'start_lon', 'end_lat', 'end_lon', 'bike_id']
clean_metro = metro.drop(columns=columns_to_remove, axis=1)
clean_metro.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 92124 entries, 0 to 92123 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 duration 92124 non-null int64 1 start_time 92124 non-null datetime64[ns] 2 end_time 92124 non-null datetime64[ns] 3 start_station 92124 non-null object 4 end_station 92124 non-null object 5 plan_duration 92124 non-null int64 6 trip_route_category 92124 non-null object 7 passholder_type 92124 non-null object 8 bike_type 92124 non-null object dtypes: datetime64[ns](2), int64(2), object(5) memory usage: 6.3+ MB
To exclude the testing data, conditionally delete rows where passholder_type is "Testing".
clean_metro = clean_metro.drop(clean_metro[clean_metro['passholder_type'] == "Testing"].index)
clean_metro["passholder_type"].unique()
array(['Monthly Pass', 'Walk-up', 'Annual Pass', 'One Day Pass',
'Flex Pass'], dtype=object)
clean_metro.isnull().sum()
duration 0 start_time 0 end_time 0 start_station 0 end_station 0 plan_duration 0 trip_route_category 0 passholder_type 0 bike_type 0 dtype: int64
The latitude and longitude attributes, which were initially considered unimportant for the analysis and removed.
# numeric data only
clean_numeric_columns = clean_metro.select_dtypes(include=[np.number])
clean_correlation_matrix = clean_numeric_columns.corr()
clean_correlation_matrix
| duration | plan_duration | |
|---|---|---|
| duration | 1.000000 | 0.041844 |
| plan_duration | 0.041844 | 1.000000 |
A correlation coefficient of 0.041844 indicates a weak and non-significant positive correlation. As a result, it should be considered in conjunction with other data and analysis when making decisions based on data analysis.
Missing Data
If the data from a station is available, its latitude and longitude can be obtained from the JSON format data provided by the website API. This is because the missing data is station-dependent. In case the data needs to be analyzed later, it can be completed with the supplementary information.
Outliers
clean_metro.boxplot(column='duration')
plt.show()
# calculate Inter Quartile Range
Q1 = clean_metro['duration'].quantile(0.25)
Q3 = clean_metro['duration'].quantile(0.75)
IQR = Q3 - Q1
print('Q3: ',Q3)
print('Q1: ',Q1)
print('IQR:',IQR)
Q3: 22.0 Q1: 6.0 IQR: 16.0
# calculate upper and lower limit
upper_limit = Q3 + 1.5*IQR
lower_limit = Q1 - 1.5*IQR
print('upper limit:', upper_limit)
print('lower limit:', lower_limit)
upper limit: 46.0 lower limit: -18.0
clean_metro_no_outlier = clean_metro[clean_metro['duration'] < upper_limit]
seaborn to compare before and after outlier removing¶import seaborn as sns
print("Before removing ouliers")
plt.figure(figsize=(16,8))
plt.subplot(2,2,1)
sns.histplot(clean_metro['duration'], kde=True)
plt.subplot(2,2,2)
sns.boxplot(data = clean_metro['duration'])
plt.show()
print("After removing ouliers")
plt.figure(figsize=(16,8))
plt.subplot(2,2,1)
sns.histplot(clean_metro_no_outlier['duration'], kde=True)
plt.subplot(2,2,2)
sns.boxplot(data = clean_metro_no_outlier['duration'])
plt.show()
Before removing ouliers
After removing ouliers
Before removing ouliers: The duration variable is highly skewed with a long tail of distribution due to a few extremely high values. The boxplot shows outliers with very high values.
After removing ouliers: The cleaned data has a more symmetrical distribution, making it easier to observe trends. The boxplot shows less outliers and shorter whiskers, making the median and interquartile range more visible. The remaining dataset provides a more reliable analysis by reducing the impact of extreme values on statistics such as averages.
clean_metro_no_outlier.to_csv('datasets/metro_cleaned.csv', index=False)
dtype_spec = {"start_station": "string","end_station": "string",
"trip_route_category": "string","passholder_type": "string",
"bike_type": "string"}
bicycle = pd.read_csv("datasets/metro_cleaned.csv", dtype = dtype_spec)
bicycle['start_time'] = pd.to_datetime(bicycle['start_time'])
bicycle['end_time'] = pd.to_datetime(bicycle['end_time'])
bicycle.info()
bicycle.head()
bicycle.describe()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 83611 entries, 0 to 83610 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 duration 83611 non-null int64 1 start_time 83611 non-null datetime64[ns] 2 end_time 83611 non-null datetime64[ns] 3 start_station 83611 non-null string 4 end_station 83611 non-null string 5 plan_duration 83611 non-null int64 6 trip_route_category 83611 non-null string 7 passholder_type 83611 non-null string 8 bike_type 83611 non-null string dtypes: datetime64[ns](2), int64(2), string(5) memory usage: 5.7 MB
| duration | start_time | end_time | plan_duration | |
|---|---|---|---|---|
| count | 83611.000000 | 83611 | 83611 | 83611.000000 |
| mean | 13.111433 | 2019-08-17 17:10:09.504012544 | 2019-08-17 17:23:16.189975040 | 60.837689 |
| min | 1.000000 | 2019-07-01 00:04:00 | 2019-07-01 00:09:00 | 1.000000 |
| 25% | 6.000000 | 2019-07-26 12:36:00 | 2019-07-26 12:46:30 | 30.000000 |
| 50% | 10.000000 | 2019-08-18 19:24:00 | 2019-08-18 19:43:00 | 30.000000 |
| 75% | 19.000000 | 2019-09-09 11:46:00 | 2019-09-09 11:59:30 | 30.000000 |
| max | 45.000000 | 2019-09-30 23:58:00 | 2019-10-01 00:03:00 | 365.000000 |
| std | 9.497429 | NaN | NaN | 108.156262 |
pd.crosstab(bicycle['passholder_type'], bicycle['duration'])
| duration | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | ... | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| passholder_type | |||||||||||||||||||||
| Annual Pass | 246 | 258 | 468 | 560 | 574 | 484 | 376 | 333 | 353 | 313 | ... | 5 | 4 | 6 | 6 | 1 | 0 | 4 | 1 | 0 | 2 |
| Flex Pass | 0 | 0 | 0 | 0 | 1 | 0 | 2 | 2 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Monthly Pass | 1396 | 1776 | 3482 | 4352 | 4454 | 4124 | 3693 | 2988 | 2686 | 2544 | ... | 92 | 79 | 74 | 66 | 53 | 49 | 54 | 40 | 40 | 37 |
| One Day Pass | 219 | 56 | 57 | 64 | 60 | 74 | 87 | 107 | 99 | 115 | ... | 40 | 40 | 41 | 60 | 36 | 32 | 36 | 30 | 26 | 36 |
| Walk-up | 695 | 167 | 227 | 332 | 500 | 631 | 748 | 725 | 733 | 728 | ... | 231 | 204 | 164 | 164 | 189 | 172 | 157 | 144 | 124 | 139 |
5 rows × 45 columns
crosstab = pd.crosstab(bicycle['trip_route_category'], bicycle['passholder_type'])
# visualization
import matplotlib.pyplot as plt
crosstab.plot(kind = "bar", stacked = True, figsize = (6, 4))
plt.title("Trip Route Category vs Passholder Type")
plt.xlabel("Trip Route Category")
plt.ylabel("Number of Trips")
plt.xticks(rotation=0) # Rotate labels to make them readable
plt.show()
Most monthly pass holders prefer to use public transportation for one-way commutes, possibly because they commute to locations with stable round-trip journeys. "Flex Pass" seems to have very little usage, especially in the "Round Trip" category, and is not even recorded.
crosstab_bike_type = pd.crosstab(bicycle['bike_type'],
bicycle['trip_route_category'
])
crosstab_bike_type.plot(kind='bar', stacked=True, figsize=(6, 4))
plt.title('Bike Type vs Trip Route Category')
plt.xlabel('Bike Type')
plt.ylabel('Number of Trips')
plt.xticks(rotation=0) # Rotate labels to make them readable
plt.show()
Standard bicycles are the most popular and commonly used travel tools. Electrically assisted bicycles and smart bicycles have low acceptance in the market, or it may be price and availability factors that influence users' choices.
bins = [0, 15, 30, float('inf')]
labels = ['Short (<15 min)', 'Medium (15-30 min)', 'Long (>30 min)']
bicycle['duration_group'] = pd.cut(bicycle['duration'], bins=bins, labels=labels)
crosstab_bike_type = pd.crosstab(bicycle['duration_group'],
bicycle['trip_route_category'
])
crosstab_bike_type.plot(kind='bar', stacked=True, figsize=(6, 4))
plt.title('Duration Group vs Trip Route Category')
plt.xlabel('Duration Group')
plt.ylabel('Number of Trips')
plt.xticks(rotation=0)
plt.show()
After binning duration shows that short-distance trips are more popular than medium- and long-distance trips, implying that most users prefer bike-sharing services for commuting over short distances. One-way trips are more frequent than round-trip journeys across all three duration categories, indicating that users mostly use bike-sharing services to travel from one point to another. The number of long-distance journeys is relatively low, which could be attributed to the fact that bicycles are more suitable for short-distance travel, or because users often opt for other transportation modes for longer journeys.
crosstab_bike_type = pd.crosstab(bicycle['duration_group'],
bicycle['bike_type'
])
crosstab_bike_type.plot(kind='bar', stacked=True, figsize=(6, 4))
plt.title('Duration Group vs Bike Type')
plt.xlabel('Duration Group')
plt.ylabel('Bike Type')
plt.xticks(rotation=0) # Rotate labels to make them readable
plt.show()
Based on the data, it appears that people tend to use regular bicycles more frequently for short trips as they are easily accessible and affordable. However, for longer journeys, e-bikes with their assisted power are a more desirable option. Bike-sharing system operators can utilize this information to optimize bike allocation and maintenance strategies, and adjust the supply of bikes based on the varying needs for journey duration.
bicycle['start_hour'] = bicycle['start_time'].dt.hour
bicycle['start_day_of_week'] = bicycle['start_time'].dt.day_name()
bicycle.head()
| duration | start_time | end_time | start_station | end_station | plan_duration | trip_route_category | passholder_type | bike_type | duration_group | start_hour | start_day_of_week | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5 | 2019-07-01 00:04:00 | 2019-07-01 00:09:00 | 4312 | 4410 | 30 | One Way | Monthly Pass | standard | Short (<15 min) | 0 | Monday |
| 1 | 9 | 2019-07-01 00:07:00 | 2019-07-01 00:16:00 | 3066 | 3066 | 30 | Round Trip | Monthly Pass | electric | Short (<15 min) | 0 | Monday |
| 2 | 5 | 2019-07-01 00:20:00 | 2019-07-01 00:25:00 | 4410 | 4312 | 30 | One Way | Monthly Pass | electric | Short (<15 min) | 0 | Monday |
| 3 | 20 | 2019-07-01 00:44:00 | 2019-07-01 01:04:00 | 3045 | 4275 | 1 | One Way | Walk-up | standard | Medium (15-30 min) | 0 | Monday |
| 4 | 27 | 2019-07-01 00:44:00 | 2019-07-01 01:11:00 | 3035 | 3049 | 30 | One Way | Monthly Pass | standard | Medium (15-30 min) | 0 | Monday |
Show the distribution of journey duration (Duration) for each day of the week.
sns.violinplot(x='start_day_of_week', y='duration', data=bicycle)
plt.title('Duration Distribution by Day of Week')
plt.xlabel('Day of Week')
plt.ylabel('Duration (minutes)')
plt.show()
The duration of journeys appears to be quite consistent throughout the week, with a median duration of between 10 and 15 minutes. The daily distribution of journey durations has a similar shape, indicating that users' usage patterns are relatively consistent regardless of the day of the week. The violin plots for Saturday and Sunday are slightly wider than on weekdays, which suggests that journey durations may vary more on weekends, or that users may tend to take slightly longer journeys. The extended tail of the violin plot shows that there are longer journey durations on some days, but these are not necessarily outliers and may simply be less common data points within the normal range.
pd.crosstab(bicycle['start_day_of_week'], bicycle['passholder_type']).plot(kind='bar', stacked=True)
plt.title('Usage by Passholder Type and Day of Week')
plt.xlabel('Day of Week')
plt.ylabel('Number of Trips')
plt.figure(figsize=(6, 3))
<Figure size 600x300 with 0 Axes>
<Figure size 600x300 with 0 Axes>
Insights into pass usage behavior are critical for operational planning and marketing strategies. For instance, if a bike-sharing service aims to boost weekend usage, it may need to create promotions that target weekend users. It's worth noting that monthly ticket holders are the most common user group, so service providers may want to focus on enhancing user satisfaction and loyalty among this group.
sns.scatterplot(x='start_hour', y='duration', data=bicycle)
plt.title('Duration by Start Hour of Trip')
plt.xlabel('Hour of Day')
plt.ylabel('Duration (minutes)')
plt.figure(figsize=(6, 3))
<Figure size 600x300 with 0 Axes>
<Figure size 600x300 with 0 Axes>
The distribution of trip duration is relatively consistent throughout the day, with no obvious trend indicating a significant increase or decrease in trip duration within a specific period. Most trips lasted less than 30 minutes, which may be because the bike-sharing system is designed for short trips or is based on a metered structure. The scatter plot shows that there are more journeys departing during certain small time periods, which could be peak or other busy times. The density of points on the scatter plot helps identify these cycles of activity.
bicycle['start_station'].value_counts().head(10).plot(kind='bar')
plt.title('Top 10 Most Popular Start Stations')
plt.xlabel('Station ID')
plt.ylabel('Number of Trips Started')
plt.figure(figsize=(6, 3))
<Figure size 600x300 with 0 Axes>
<Figure size 600x300 with 0 Axes>
The chart shows the number of journeys that started from each station, ranked by ID on the X-axis. Site ID 3005 is the most popular starting point, but usage is not uniform. Understanding usage patterns is crucial for maintenance and city planning. Operators should ensure high-traffic sites have enough bikes and staff during peak times.
import seaborn as sns
import matplotlib.pyplot as plt
route_counts = bicycle.groupby(['start_station', 'end_station']).size().reset_index(name='count')
pivot_table = route_counts.pivot(index='start_station', columns='end_station', values='count')
plt.figure(figsize=(8,6))
sns.heatmap(pivot_table, cmap="YlGnBu")
plt.title('Heatmap of Route Traffic')
plt.xlabel('End Station')
plt.ylabel('Start Station')
plt.show()
In the heat map, the color shade indicates the number of journeys with lighter shades representing fewer journeys and darker shades representing more journeys.
The bike share route traffic heat map helps operators identify popular stations and routes for targeted operational decisions. However, analyzing this data with other factors like location, population density, and transportation methods is crucial for a comprehensive understanding.
plt.figure(figsize=(6, 4))
plt.title('Daily Trips Over Time')
plt.xlabel('Date')
plt.ylabel('Number of Trips')
bicycle.resample('D', on='start_time').size().plot()
<Axes: title={'center': 'Daily Trips Over Time'}, xlabel='start_time', ylabel='Number of Trips'>
There is a certain pattern to the fluctuations in the data, which may be due to differences between weekdays and weekends, or regular changes caused by certain daily activities. It appears that the data fluctuations are centered around a specific level and exhibit a certain periodicity.
bicycle['start_time'] = pd.to_datetime(bicycle['start_time'])
bicycle['start_hour'] = bicycle['start_time'].dt.hour
bicycle['day_of_week'] = bicycle['start_time'].dt.day_name()
mean_duration_by_hour = bicycle.groupby('start_hour')['duration'].mean()
mean_duration_by_day = bicycle.groupby('day_of_week')['duration'].mean()
plt.figure(figsize=(6, 3))
sns.barplot(x=mean_duration_by_hour.index, y=mean_duration_by_hour.values,
hue=mean_duration_by_hour.index,
palette='muted',
legend=False)
plt.title('Average Ride Duration by Start Hour')
plt.xlabel('Hour of Day')
plt.ylabel('Average Duration (minutes)')
plt.xticks(range(0, 24))
plt.grid(axis='y')
plt.show()
Riding time is longer during late night to early morning hours (0-5 o'clock) with highest average riding times around 0 and 3 o'clock. As morning approaches, the average riding time gradually decreases, which could be related to people commuting to work. Throughout the day (9:00 to 19:00), the average riding time generally increases, especially in the afternoon. Although there is a slight drop in riding times during the evening hours, the average is still relatively high.
plt.figure(figsize=(6, 3))
ordered_days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
sns.barplot(x=mean_duration_by_day.index, y=mean_duration_by_day.reindex(ordered_days),
hue=mean_duration_by_day.index,
palette='Spectral',
legend=False)
plt.title('Average Ride Duration by Day of the Week')
plt.xlabel('Day of the Week')
plt.ylabel('Average Duration (minutes)')
plt.grid(axis='y')
plt.show()
On weekends, people tend to ride for a longer time as compared to weekdays, which could be due to leisure activities or travel. On weekdays, the average ride time is shorter, and Tuesday has the shortest average ride time, which may be due to commuting patterns.
The most common pass type is the "Monthly Pass". The "Flex Pass" is relatively uncommon.
bicycle['passholder_type'].value_counts()
passholder_type Monthly Pass 55962 Walk-up 18061 Annual Pass 6120 One Day Pass 3462 Flex Pass 6 Name: count, dtype: Int64
import matplotlib.pyplot as plt
import seaborn as sns
mean_duration_by_passholder = bicycle.groupby('passholder_type')['duration'].mean()
plt.figure(figsize=(6, 4))
sns.barplot(x=mean_duration_by_passholder.index,
y=mean_duration_by_passholder.values,
hue=mean_duration_by_passholder.index,
palette='viridis',
legend=False)
plt.title('Mean Duration by Passholder Type')
plt.xlabel('Passholder Type')
plt.ylabel('Mean Duration (minutes)')
plt.tight_layout()
plt.show()
The users of "One Day Pass" and "Walk-up" have the longest average riding time. On the other hand, the "Annual Pass" and "Flex Pass" holders have the shortest average riding time.
To compare the averages of different passholder types, it is necessary to perform an ANOVA (Analysis of Variance) test due to having more than two sets of data. However, before conducting such a test, it is crucial to ensure that the data meets the basic assumptions of the ANOVA test, which include normal distribution and homogeneity of variation in each set of data.
from scipy import stats
normality_test_results = {}
for pass_type in bicycle['passholder_type'].unique():
duration_data = bicycle[bicycle['passholder_type'] == pass_type]['duration']
normality_test_results[pass_type] = stats.shapiro(duration_data)[1] # p-value
normality_test_results
/Users/Kate/anaconda3/lib/python3.10/site-packages/scipy/stats/_morestats.py:1882: UserWarning: p-value may not be accurate for N > 5000.
warnings.warn("p-value may not be accurate for N > 5000.")
/Users/Kate/anaconda3/lib/python3.10/site-packages/scipy/stats/_morestats.py:1882: UserWarning: p-value may not be accurate for N > 5000.
warnings.warn("p-value may not be accurate for N > 5000.")
/Users/Kate/anaconda3/lib/python3.10/site-packages/scipy/stats/_morestats.py:1882: UserWarning: p-value may not be accurate for N > 5000.
warnings.warn("p-value may not be accurate for N > 5000.")
{'Monthly Pass': 0.0,
'Walk-up': 0.0,
'Annual Pass': 0.0,
'Flex Pass': 0.5543804168701172,
'One Day Pass': 5.135869678977664e-28}
All passholder types, except for "Flex Pass", have p-values that are either close to or equal to 0. This indicates that the ride time data for these pass types does not follow the assumption of a normal distribution. However, it is important to interpret these results with caution because the Shapiro-Wilk test can be too strict when dealing with large sample sizes. As for "Flex Pass", there is only a small amount of data available (only 6 transactions), so the test results may not be as reliable.
from scipy import stats
levene_test_result = stats.levene(
*[bicycle[bicycle['passholder_type'] == pass_type]['duration'] for pass_type in bicycle['passholder_type'].unique()]
)[1]
levene_test_result
0.0
The p-value is close to 0, indicating that the variation in riding time is inconsistent between different passholder_types, violating the assumption of homogeneity of variation.
The ANOVA test assumes normal distribution and homogeneity of variations. However, if the data does not fully meet these assumptions, it is recommended to use the Kruskal-Wallis H test instead. This is a nonparametric method that compares medians from multiple independent samples, and it does not require the data to have normal distribution or the same amount of variation.
kruskal_test_result = stats.kruskal(
*[bicycle[bicycle['passholder_type'] == pass_type]['duration'] for pass_type in bicycle['passholder_type'].unique()]
)
kruskal_test_result
KruskalResult(statistic=7295.39454547583, pvalue=0.0)
The Kruskal-Wallis H test indicates a statistically significant difference in average ride time among different passholder types, with a test statistic of 7295.39 and a p-value close to 0.
The analysis shows that different types of passholders, classified by their passholder_type, tend to have varying average ride times. This implies that some passholder types may use the bikes for longer or shorter durations. For instance, the analysis suggests that "One Day Pass" and "Walk-up" users have longer average ride times, whereas "Annual Pass" and "Flex Pass" users have shorter average ride times.
import pandas as pd
file_path = 'Datasets/seeds.csv'
seeds = pd.read_csv(file_path)
seeds.head()
| area | perimeter | compactness | length | width | asymmetry | groove_length | |
|---|---|---|---|---|---|---|---|
| 0 | 15.26 | 14.84 | 0.871 | 5.763 | 3.312 | 2.221 | 5.220 |
| 1 | 14.88 | 14.57 | 0.881 | 5.554 | 3.333 | 1.018 | 4.956 |
| 2 | 14.29 | 14.09 | 0.905 | 5.291 | 3.337 | 2.699 | 4.825 |
| 3 | 13.84 | 13.94 | 0.895 | 5.324 | 3.379 | 2.259 | 4.805 |
| 4 | 16.14 | 14.99 | 0.903 | 5.658 | 3.562 | 1.355 | 5.175 |
seeds.describe()
| area | perimeter | compactness | length | width | asymmetry | groove_length | |
|---|---|---|---|---|---|---|---|
| count | 210.000000 | 210.000000 | 210.000000 | 210.000000 | 210.000000 | 210.000000 | 210.000000 |
| mean | 14.847524 | 14.559286 | 0.871000 | 5.628533 | 3.258605 | 3.700200 | 5.408071 |
| std | 2.909699 | 1.305959 | 0.023594 | 0.443063 | 0.377714 | 1.503559 | 0.491480 |
| min | 10.590000 | 12.410000 | 0.808000 | 4.899000 | 2.630000 | 0.765000 | 4.519000 |
| 25% | 12.270000 | 13.450000 | 0.857250 | 5.262250 | 2.944000 | 2.561500 | 5.045000 |
| 50% | 14.355000 | 14.320000 | 0.873500 | 5.523500 | 3.237000 | 3.599000 | 5.223000 |
| 75% | 17.305000 | 15.715000 | 0.887750 | 5.979750 | 3.561750 | 4.768750 | 5.877000 |
| max | 21.180000 | 17.250000 | 0.918000 | 6.675000 | 4.033000 | 8.456000 | 6.550000 |
seeds.isnull().sum()
area 0 perimeter 0 compactness 0 length 0 width 0 asymmetry 0 groove_length 0 dtype: int64
sns.set(style="whitegrid")
sns.pairplot(seeds)
plt.show()
There is a clear linear relationship between certain features, such as area and perimeter, length and groove length, while others, such as asymmetry, are more scattered and lack an obvious linear trend.
plt.figure(figsize=(6, 4))
sns.boxplot(data=seeds)
plt.title("Seeds feature boxplot")
plt.xticks(rotation=45)
plt.show()
corr_matrix = seeds.corr()
plt.figure(figsize=(6, 4))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Seed feature correlation coefficient heat map")
plt.show()
It has been observed that certain features, such as perimeter and area, length and groove length, etc., have a strong positive correlation with each other. This implies that these features are closely associated with one another. However, some features are less correlated with each other, such as asymmetry, which is not strongly correlated with other features.
Based on the data exploration and observations above, considering using either K-means or Gaussian Mixture Model (GMM) for cluster analysis. Here are the reasons why these two methods are suitable:
K-means:
Gaussian Mixture Model (GMM):
To determine the optimal K value for K-means, one can use the elbow method. This involves calculating the cost of K-means, which is the sum of the squared distances of each point from its center of mass, over a range of K values. The K value where the cost decreases the fastest is considered the optimal value.
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3, n_init=10)
# data normalization
scaler = StandardScaler()
seeds_scaled = scaler.fit_transform(seeds)
seeds_scaled_df = pd.DataFrame(seeds_scaled, columns=seeds.columns)
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42, n_init=10)
kmeans.fit(seeds_scaled)
wcss.append(kmeans.inertia_)
# get the elbow
plt.figure(figsize=(6,4))
plt.plot(range(1, 11), wcss, marker='o', linestyle='--')
plt.title('Elbow Rule')
plt.xlabel('k value')
plt.ylabel('WCSS')
plt.grid(True)
plt.show()
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
y_kmeans = kmeans.fit_predict(seeds_scaled)
seeds_clustered = seeds.copy()
seeds_clustered['Cluster'] = y_kmeans
seeds_clustered.head()
| area | perimeter | compactness | length | width | asymmetry | groove_length | Cluster | |
|---|---|---|---|---|---|---|---|---|
| 0 | 15.26 | 14.84 | 0.871 | 5.763 | 3.312 | 2.221 | 5.220 | 2 |
| 1 | 14.88 | 14.57 | 0.881 | 5.554 | 3.333 | 1.018 | 4.956 | 2 |
| 2 | 14.29 | 14.09 | 0.905 | 5.291 | 3.337 | 2.699 | 4.825 | 2 |
| 3 | 13.84 | 13.94 | 0.895 | 5.324 | 3.379 | 2.259 | 4.805 | 2 |
| 4 | 16.14 | 14.99 | 0.903 | 5.658 | 3.562 | 1.355 | 5.175 | 2 |
Visualizing based on the two features area and perimeter.
clusters = kmeans.fit_predict(seeds)
plt.figure(figsize=(8, 5))
plt.scatter(seeds['area'], seeds['perimeter'], c=clusters, cmap='viridis', marker='x', alpha=0.7)
plt.title('K-means cluster result (Area vs Perimeter)')
plt.xlabel('Area')
plt.ylabel('Perimeter')
plt.colorbar(label='Cluster')
plt.show()
The dataset is divided into three clusters, each represented by a different color. The data points within each cluster are more closely grouped in terms of area and perimeter, indicating that these two features are significant in clustering. The boundaries between the clusters are distinct, demonstrating that K-means is capable of accurately distinguishing the data. Based on these characteristics, the results of this analysis suggest that K-means clustering is a suitable method for classifying the seed dataset.
from sklearn.decomposition import PCA
# reduce the data to two principal components
pca = PCA(n_components=2)
seeds_pca = pca.fit_transform(seeds)
plt.figure(figsize=(8, 5))
plt.scatter(seeds_pca[:, 0], seeds_pca[:, 1], c=clusters, cmap='viridis', marker='o')
plt.title('K-means cluster result (PCA downgrade the dimension)')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.colorbar(label='Cluster')
plt.show()
Each point in this 2D space represents a seed sample, colored by the cluster it belongs to. The clustering structure is well-maintained in the dimensionality reduction process, making it easy to distinguish the three clusters. Principal component analysis effectively captures the primary variability of the data, allowing us to observe the cluster distribution in a 2D space.
The silhouette score ranges from -1 to 1, with higher values indicating better clustering quality.
from sklearn.metrics import silhouette_score
# test k= 2,3,4
k_values = range(2, 5)
silhouette_scores = []
for k in k_values:
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
kmeans.fit(seeds_scaled)
labels = kmeans.labels_
silhouette_scores.append(silhouette_score(seeds_scaled, labels))
k_silhouette_compare = list(zip(k_values, silhouette_scores))
k_silhouette_compare
[(2, 0.46580082738022255), (3, 0.40064436862413266), (4, 0.32634482280657695)]
When k=2, the clustering quality is the best, with a silhouette score of 0.466.
Use the initial number = 3 based on the result of the PCA and the elbow method.
from sklearn.mixture import GaussianMixture
# data normalization
scaler = StandardScaler()
data_scaled = scaler.fit_transform(seeds)
# choose 3 as cluster number
n_clusters = 3
gmm = GaussianMixture(n_components=n_clusters, random_state=42)
gmm.fit(data_scaled)
# result prediction
labels = gmm.predict(data_scaled)
plt.scatter(data_scaled[:, 0], data_scaled[:, 1], c=labels, cmap='viridis', marker='x')
plt.xlabel('feature 1')
plt.ylabel('feature 2')
plt.title('GMM Cluster Result')
plt.show()
from sklearn.mixture import GaussianMixture
# data normalization
scaler = StandardScaler()
data_scaled = scaler.fit_transform(seeds)
aic_values = []
bic_values = []
range_of_clusters = range(2, 11)
for n_clusters in range_of_clusters:
gmm = GaussianMixture(n_components=n_clusters, random_state=0)
gmm.fit(data_scaled)
aic_values.append(gmm.aic(data_scaled))
bic_values.append(gmm.bic(data_scaled))
plt.figure(figsize=(6, 4))
plt.plot(range_of_clusters, aic_values, label='AIC', marker='o')
plt.plot(range_of_clusters, bic_values, label='BIC', marker='o')
plt.xlabel('Number of Clusters')
plt.ylabel('Scores')
plt.title('AIC and BIC Scores for Different Number of Clusters')
plt.legend()
plt.grid(True)
plt.show()
This chart suggests that 2 clusters may be the best choice for this set of data, since at this number both AIC and BIC reach relatively low values.
from sklearn.mixture import GaussianMixture
range_of_clusters = range(2, 5)
silhouette_scores = []
for n_clusters in range_of_clusters:
gmm = GaussianMixture(n_components=n_clusters, random_state=0)
labels = gmm.fit_predict(seeds)
score = silhouette_score(seeds, labels)
silhouette_scores.append(score)
# evaluate
eval_clusters = range_of_clusters[silhouette_scores.index(max(silhouette_scores))]
eval_clusters, max(silhouette_scores)
(2, 0.48772862024342833)
The silhouette coefficient is approximately 0.488, indicating moderate clustering quality with some separation between clusters, but potential for improvement.
optimal_clusters = 2
gmm = GaussianMixture(n_components=optimal_clusters, random_state=0)
labels = gmm.fit_predict(seeds)
data_with_clusters = seeds.copy()
data_with_clusters['Cluster'] = labels
plt.figure(figsize=(10, 6))
sns.pairplot(data_with_clusters, hue='Cluster', palette='viridis')
plt.suptitle("GMM Cluster Analysis of Seeds Data", y=1.02)
plt.show()
<Figure size 1000x600 with 0 Axes>
The results of GMM clustering are represented by different colors for different clusters. If the separation between clusters is clearly visible, it usually indicates good clustering. However, if the boundaries between clusters are blurred, it may indicate that the clustering effect is average.
GMM is preferred for the following resons:
GMM may perform better than K-Means if clusters are not circular or spherical, or if they are of different sizes and densities. The grid search suggests that SVM with a linear kernel is a good choice for this dataset due to its high accuracy and interpretability. Monitor the model's performance on new data to prevent overfitting and make timely adjustments.
When data set clusters overlap, GMM's soft clustering captures overlaps better than K-Means' hard clustering, which can lead to unclear classification.
Although the use of GMM may be computationally more expensive, it does not affect training due to the small size of the data set.
# load data to see the file format
txt_file_path = 'Datasets/seeds_dataset.txt'
with open(txt_file_path, 'r') as file:
seeds_txt_content = file.read()
seeds_txt_content[:500]
'15.26\t14.84\t0.871\t5.763\t3.312\t2.221\t5.22\t1\n14.88\t14.57\t0.8811\t5.554\t3.333\t1.018\t4.956\t1\n14.29\t14.09\t0.905\t5.291\t3.337\t2.699\t4.825\t1\n13.84\t13.94\t0.8955\t5.324\t3.379\t2.259\t4.805\t1\n16.14\t14.99\t0.9034\t5.658\t3.562\t1.355\t5.175\t1\n14.38\t14.21\t0.8951\t5.386\t3.312\t2.462\t4.956\t1\n14.69\t14.49\t0.8799\t5.563\t3.259\t3.586\t5.219\t1\n14.11\t14.1\t0.8911\t5.42\t3.302\t2.7\t\t5\t\t1\n16.63\t15.46\t0.8747\t6.053\t3.465\t2.04\t5.877\t1\n16.44\t15.25\t0.888\t5.884\t3.505\t1.969\t5.533\t1\n15.26\t14.85\t0.8696\t5.714\t3.242\t4.543\t5.314\t1\n14.03\t14.16\t0.87'
The original dataset includes a label field that is not present in the provided dataset. This label field probably indicates the actual type or category of the seed, which is helpful for supervised learning tasks.
# add column names and save new file
file_path = 'Datasets/seeds_dataset.txt'
column_names = ["area", "perimeter", "compactness", "length", "width", "asymmetry", "groove_length", "label"]
seeds_source = pd.read_csv(file_path, delim_whitespace=True, header=None, names=column_names)
output_path = 'Datasets/seeds_source.csv'
seeds_source.to_csv(output_path, index=False)
# read the source data
file_path = 'Datasets/seeds_source.csv'
source_seeds = pd.read_csv(file_path)
source_seeds.head()
| area | perimeter | compactness | length | width | asymmetry | groove_length | label | |
|---|---|---|---|---|---|---|---|---|
| 0 | 15.26 | 14.84 | 0.8710 | 5.763 | 3.312 | 2.221 | 5.220 | 1 |
| 1 | 14.88 | 14.57 | 0.8811 | 5.554 | 3.333 | 1.018 | 4.956 | 1 |
| 2 | 14.29 | 14.09 | 0.9050 | 5.291 | 3.337 | 2.699 | 4.825 | 1 |
| 3 | 13.84 | 13.94 | 0.8955 | 5.324 | 3.379 | 2.259 | 4.805 | 1 |
| 4 | 16.14 | 14.99 | 0.9034 | 5.658 | 3.562 | 1.355 | 5.175 | 1 |
The additional label field in the original data set provides a possible true cluster structure and can be used to evaluate the accuracy of the clustering results after the analysis is complete.
Access to the actual labels allows for assessment of the quality of the clustering by comparing the results with the real labels. This can be achieved through various methods such as confusion matrix, purity, Rand index.
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics.cluster import rand_score
# Separate the features and the label
X = source_seeds.drop('label', axis=1)
y_true = source_seeds['label']
n_clustrs = len(y_true.unique()) # 3
# standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
kmeans = KMeans(n_clusters=n_clustrs, random_state=0, n_init=10)
y_pred_scaled = kmeans.fit_predict(X_scaled)
# confusion matrix
conf_matrix_scaled = confusion_matrix(y_true, y_pred_scaled)
# heatmap
plt.figure(figsize=(6, 4))
sns.heatmap(conf_matrix_scaled, annot=True, fmt="d", cmap="Blues",
xticklabels=True, yticklabels=True)
plt.title('Confusion Matrix')
plt.ylabel('Actual Label')
plt.xlabel('Predicted Label')
plt.show()
Although most data points are correctly classified, some are still misclassified in the confusion matrix.
rand_index_scaled = rand_score(y_true, y_pred_scaled)
rand_index_scaled
0.8997038049669629
The Rand index is 0.8997, indicating good agreement between clustering results and actual labels.
def calculate_purity(conf_matrix):
cluster_max = np.sum(np.amax(conf_matrix, axis=0))
total_points = np.sum(conf_matrix)
purity = cluster_max / total_points
return purity
purity = calculate_purity(conf_matrix_scaled)
purity
0.919047619047619
Between the range 0 and 1, a purity value of 0.9190 indicates that the clustering results match the true labels fairly well. Ignoring the distribution of labels, purity may be high if one cluster contains all other labels.
Support Vector Machines (SVM) are effective for handling nonlinear and high-dimensional data by utilizing various kernel functions to handle linearly inseparable data.
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report
y = source_seeds['label']
# split data for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# standardize
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# set SVM with default params
svm_model = SVC(kernel='rbf', C=1.0, gamma='auto')
#svm_model = SVC(kernel='linear', C=100, gamma='scale')
# train
svm_model.fit(X_train_scaled, y_train)
# prediction
y_pred = svm_model.predict(X_test_scaled)
# result
report = classification_report(y_test, y_pred)
print(report)
precision recall f1-score support
1 0.86 0.90 0.88 20
2 1.00 0.95 0.98 21
3 0.91 0.91 0.91 22
accuracy 0.92 63
macro avg 0.92 0.92 0.92 63
weighted avg 0.92 0.92 0.92 63
The result above is a performance evaluation report of SVM classifier.
The precision rate and the recall rate overall are quite high, which means that the model can accurately identify 90% of the samples that belong to this category.
The F1 score for the first class is 0.88, indicating a relatively balanced precision and recall for this class.
The macro average is 0.92, indicating good overall model performance..
The weighted average is 0.92, which indicates that the model performed well across all classes, despite differences in sample size. This is a good sign of the model's overall performance.
The SVM classifier used in this dataset has shown excellent performance, achieving scores of 0.9 or higher for all metrics. This indicates that the model is highly accurate in classifying different categories of seeds, with a low error rate. Nevertheless, it is important to note that such high performance may lead to overfitting issues.
from sklearn.model_selection import cross_val_score
accuracy_scores = cross_val_score(svm_model, X_scaled, y, cv=5, scoring='accuracy')
print("Accuracy: %0.2f (+/- %0.2f)" % (accuracy_scores.mean(), accuracy_scores.std() * 2))
Accuracy: 0.91 (+/- 0.16)
The results of the cross-validation indicate that the SVM model performs well on this dataset, with an average accuracy of 0.91 and a standard deviation of 0.16, indicating that it has good classification capabilities. However, there is some variability in accuracy across different training and testing splits, as indicated by a standard deviation of 0.08. The range obtained by adding and subtracting the standard deviation can be considered a confidence interval for the model's performance.
The relatively wide confidence interval suggests that different data splits may bring certain changes to the model's performance. This could be due to some degree of heterogeneity within the dataset, or it could be that the dataset itself is small, resulting in large differences on different cross-validation splits.
Overall, the model demonstrates strong classification ability, but the cross-validation variability suggests that we may need to consider using more data to improve its generalization ability.
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
# set the pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('svm', SVC())
])
# set the param grid
param_grid = {
'svm__C': [0.1, 1, 10, 100],
'svm__gamma': ['scale', 'auto', 0.01, 0.1, 1],
'svm__kernel': ['rbf', 'poly', 'linear']
}
# init grid search
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X, y)
# result
best_parameters = grid_search.best_params_
best_score = grid_search.best_score_
print(f"Best Parameters: {best_parameters}")
print(f"Best Score: {best_score:.4f}")
Best Parameters: {'svm__C': 100, 'svm__gamma': 'scale', 'svm__kernel': 'linear'}
Best Score: 0.9476
Pros:
Using 'scale' for gamma adjusts its value based on feature variance, resulting in better generalization than a fixed value.
SVM models that use linear kernels are relatively easy to interpret. The model's decision-making process is easier to understand, and feature weights can be directly viewed to assess feature importance.
Cons:
Although increasing the C value can improve the model's fit to the training data, it can also lead to overfitting, especially when the dataset is small or the number of features is large.
Training SVMs, especially with nonlinear kernels, can be time-consuming for large datasets.
To summarize, the grid search results indicate that SVM with a linear kernel is a suitable choice for this dataset, as it exhibits high accuracy and good interpretability. It is important to monitor the model's performance on new data to prevent overfitting and make necessary adjustments in a timely manner.
SVM is preferred for the following resons:
SVM can handle nonlinear relationships by using kernel techniques to map data into a higher-dimensional space and find linearly separable boundaries.
For data sets where the sample size is not very large, SVM can usually achieve good performance.
The SVM regularization feature enables users to control overfitting by adjusting the regularization parameter C to maximize the decision boundary margin while minimizing error.
Compared to decision trees or random forests, SVM naturally prevents overfitting, especially with more features than samples.